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Abstract — The following critical phenomenon was recently 
discovered. When a memory less source is compressed using 
a variable-length fixed-distortion code, the fastest conver- 
gence rate of the (pointwise) compression ratio to R{D) is 
either 0{^/n) or O(logn). We show it is always 0{\/n), except 
for discrete, uniformly distributed sources. 

Keywords — Redundancy, rate-distortion theory, lossy data 
compression 

I. Introduction 

SUPPOSE that data is produced by a stationary mcm- 
oryless source {X„ ; n > 1}, so that the Xi are inde- 
pendent and identically distributed (IID) random variables 
with common distribution P. We will assume throughout 
that the Xi take values in the source alphabet A, where A 
is a subset of M, and that the reproduction alphabet A is a 
finite subset of M, say A = {ai, a2, . . . , a^}. 

The main objective of data compression is to find efficient 
approximate representations for data x" = {xi, X2-, ■ ■ ■ ,Xn) 
generated from the source X" — {Xi, X2, ■ ■ ■ , Xn). Specif- 
ically, we wish to represent each source string x" by a cor- 
responding string Hi — (yi, 7/2, ■ • ■ , Vn) taking values in the 
reproduction alphabet A, so that the distortion between 
each x" and its representation lies within some fixed allow- 
able range. For our purposes, distortion is measured by a 
family of single-letter distortion measures, 



X 



1 " 

Pn{xi,yi) = - Vp(a;,,: 
n ^ — ' 

i=l 



gA", y^ei", (1) 



I where p : Ax A ^> [0, 00) is a fixed nonnegative function. 

We consider variable-length block codes operating at a 
\ fixed distortion level, that is, codes C„ defined by triplets 
. (-B„,0„,'0n) where: 

(a) Bn is a subset of A" called the codebook; 
' {b) (j)n '■ A" — > Bn is a map called the encoder, 

(c) ip„ : B„ — > {0,1}* is a prefix- free representa- 
tion of the elements of i?„ by finite-length binary 
strings. 

For a fixed distortion level D > 0, the code C„ — 
{Bn,<j>n,'ipn) IS Said to Operate at distortion level 13 S if 
it encodes each source string with distortion D or less: 

Pn{x^, <f>n {x^))<D for all x'l G ^" . 
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Our main quantity of interest here is the description length 
of a block code C„, expressed by its length function 

£„: A" ^ N: 

CK)= length of [V„(0„K))]. 

Broadly speaking, the smaller the description length, the 
better the code. 

Shannon's celebrated source coding theorem states 
that, for an arbitrary sequence of block codes {C„ = 
{Bn,(t>m4'n) ; ^^ ^ 1} Operating at distortion level D, the 
expected compression ratio -E[^„(X")]/n is asymptotically 
bounded below by the rate-distortion function R{D): 

liminf ^ "^ ^ ^^ > R{D) bits per symbol. 

n^oG Tl 

Moreover, Shannon showed that there exist codes achieving 
the above lower bound with equality; see Shannon's 1959 
paper flit] or Berger's classic text 0. A stronger version of 
Shannon's theorem was proved by Kieffer in 1991 S , where 
it is shown that the rate-distortion function is a pointwise 
asymptotic lower bound for ^„(X"): 



liminf ^-^^^^ > R{D) 



with prob. 1. 



(2) 



In ly it is also demonstrated that the bound in (g) can be 
achieved with equality. 

The following refinement to Kieffer's result was recently 
given in pO| : 

(POINTWISE REDUNDANCY): For any sequence of 
block codes {Cn} with associated length functions {in}, 
operating at distortion level D , 

n 

£n{X^) > nR{D)+Y,I{Xi)-2\ogn 
1=1 
eventually, with prob. 1, (3) 

where f : A -^ W is a bounded function depending 
on P and D but not on the codes {Cn}, such that 
Ep[f{Xi)] — 0. Moreover, there exist codes {Cn,in} 
that achieve 

n 

in{X^) < nR{D)+Y,fiX^) + 5\ogn 

i=l 

eventually, with prob. 1. (4) 

[cf. Theorems 4 and 5 and eq. (18) in |]10[; above and 
throughout the paper, 'log' denotes the logarithm taken 
to base 2 and 'logg' denotes the natural logarithm.] The 
function / is defined precisely in Section III; here we just 
mention the following interpretation: If we write f{x) — 
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/(x) + R{D), then / can be expressed in a natural way in 
terms of familiar information theoretic quantities. In par- 
ticular, E{f{Xi)) = R{D), its variance ct^ = Var(/(Xi)) 
is the "minimal coding variance" of the source with distri- 
bution P lln], and in the case of lossless compression (as 
-D i ), f{x) reduces to — \ogP{x). 

The above result says that, for any source distribution 
P and any sequence of codes {C„} operating at distor- 
tion level D, the "pointwise redundancy" in the descrip- 
tion lengths of the codes C„, namely, the difference be- 
tween in{X'i) and the optimum nR{D) bits, is essentially 
bounded below by the sum of the IID, bounded, zero- 
mean random variables f(Xi). So there are two possibili- 
ties: 

• Either the random variables f{Xi) are non- 
constant, in which case the best achievable point- 
wise redundancy rate will be of order 0{\/n) (by 
the central limit theorem and the upper and lower 
bounds in (|) and (|)); 

• or the random variables f{Xi) are equal to zero 
with probability one, in which case the best 
achievable pointwise redundancy is no more than 
(51ogn) bits, eventually (by (^)). 

To be more precise, in the first case when the random vari- 
ables f{Xi) are not constant, the central limit theorem 
implies that the term Y^^=i fi^i) i^ of order 0{y^) in 
probability, and therefore, by (||) and (||), the best achiev- 
able pointwise redundancy will also be of order 0{^/n) 
in probability. [In a similar fashion, the law of the iter- 
ated logarithm implies that the pointwise fluctuations of 
the best achievable pointwise redundancy will be of order 
0(-\/nloggloggn); see jl^. Section I] for a more detailed 
discussion. Also the contrast between the pointwise and the 
expected redundancy rate is interpreted and commented on 
in 0, Remark 3, p. 139].] 

Our purpose in this paper is to characterize exactly when 
each one of the above two cases occurs, namely, when the 
minimal pointwise redundancy is 0{^/n) and when it is 
O(logn). In the next section we show that it is almost 
never the case that f{Xi) — with probability one, so the 
minimal pointwise redundancy is typically of order y/n. In 
particular, in the common case when the Xi take values in 
a finite alphabet A = A^ then (under mild conditions) we 
show that f{Xi) = with probability one if and only if the 
Xi are uniformly distributed. 

Before stating our main results (Theorems 1, 2 and 3 in 
the next section) in detail, we recall the following represen- 
tative examples from ||] and jlOJ . 

Example 1 (Lossless Compression) For a source {Xn] 
with distribution P on the finite alphabet A, a lossless 
code Cn is a prefix-free map i/^n : A" — * {0, 1}*. [Or, to 
be pedantic, in our setting a lossless code is a code oper- 
ating at distortion level D = with respect to Hamming 
distortion.] In this case the function / has the simple form 



where H{P) = Ep[—\ogP{Xi)] is the entropy of P, and 
the lower bound (0) is simply 

n 

C(^i") > nH{P) + Y,f{X^) -2logn 

i=l 

= -\ogP{X^)-2\ogn (6) 

eventually, with prob. 1. 

The lower bound (|g) is a well-known information-theoretic 
fact called Barron's lemma (see [||p and the discussion 
in [0). It says that the description lengths ^„(X") of an 
arbitrary sequence of codes are (eventually with probability 
1) bounded below by the idealized Shannon code lengths 
— logP(X"), up to terms of order logn. From (||) it is 
obvious that f{Xi) = with probability one if and only if 
P is the uniform distribution on A. 

Example 2 (Binary Source, Hamming Distortion) 
This is the simplest non-trivial lossy example. Suppose 
{X„} is a binary source with Bernoulli(p) distribution for 
some p e (0, 1/2]. Let A = A ~ {0,1} and take p to be the 
Hamming distortion measure, p{x, y) — when x ~ y, and 
equal to 1 otherwise. For each fixed D S (0,p) it is shown 
in 0] that 



/(^) 



log 



1-D 



Ef 



log 



fP{Xi 



\1- D 



from which it is again obvious that f{Xi) — with proba- 
bility one if and only if p — 1/2, i.e., if and only if P is the 
uniform distribution on A = {0, 1}. 

In a third example presented in jlQ] it is also found that 
f{Xi) = with probability one if and only if P is the 
uniform distribution, and the natural question is raised as 
to whether this pattern persists in general. In the next 
section we answer this question by showing (in Theorem 1 
and Corollary 1) that for a source distribution P on a finite 
alphabet, f{Xi) can be equal to zero with probability one 
for at most finitely many distortion levels D, unless P is the 
uniform distribution and p is a "permutation" distortion 
measure. In Theorems 2 and 3 and in Corollary 2 the 
continuous case is considered, and it is shown that when 
P is a continuous distribution it essentially never happens 
that ,f{Xi) = with probability one. Section III contains 
the proofs of Theorems 1, 2 and 3 and Corollaries 1 and 2. 

II. Results 

Suppose that the source alphabet A is an arbitrary 
(Borel) subset of M, and let P be a (Borel) probability 
measure on R, supported on A (the special cases when P 
is purely discrete or purely continuous are considered sepa- 
rately below). Let A — {ai, 02, . . . , ak} be the finite repro- 
duction alphabet of size k. Given an arbitrary, bounded, 
nonnegative function p : A x A — > [0, M] (for some finite 
constant M), define a sequence of single-letter distortion 
measures pn : A'^xA^ — > [0, M] as in (|l|). Throughout the 
paper, we make the usual assumption: 



f{x) = -\ogP{x)-H{P) 



(5) 



sup minp(a;, y) 

^"^^yeA 



0. 



(7) 
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[See, e.g., |, p.26] or |, Ch.l3, ex.4]; if (|) is not satis- 
fied, for example when A is an interval of real numbers, 
A is a finite set, and p{x,y) — {x — y)^, we may consider 
the distortion measure p'{x,y) — p{x,y) — min^^^ p(x, z) 
instead.] For D >Q, the rate- distortion function of a mem- 
oryless source with distribution P is 



R(D) = inf IiX:Y) 

(X.Y) 



(8) 



where the infimum is over all jointly distributed random 
variables {X, Y) with values in AxA such that X has distri- 
bution P and E[p{X, ¥)] < D; I{X; Y) denotes the mutual 
information (in bits) between X and Y (see Q for more de- 
tails). Under our assumptions, the rate-distortion function 
R{D) is a convex, nonincreasing function of D > 0, and it 
is finite for all D. 

For a fixed distribution P on A, let 

Anax = £'max(-P) = Hliu Ep[p{X,y)] 

yeA 

and recall that R{D) = for D > Z^max (see, e.g.. Propo- 
sition 1 in Section III). In order to avoid the trivial case 
when R{D) is identically zero we assume that D^ax > 0, 
and from now on we restrict our attention to the interesting 
range of distortion levels D E (0, 1?max)- 

A. The Discrete Case: A = A 

We first consider the most common case where the 
source {Xn} takes values in a finite alphabet A = A = 
{ai, a2, . . . , afc}. Suppose that {Xn} are IID with common 
distribution P on A, and assume, without loss of generality, 
that Pi = P{ai) > for alH = I, . . . ,k. Given a distortion 
measure p, write pij for p{ai,aj). We assume throughout 
this section that p is symmetric, i.e., that pij — pji for all 
i,j, and also that pij = if and only if i = j. We call 
p a permutation distortion measure, if all rows of the ma- 
trix (pij)ij=i,...,fc are permutations of one another (which, 
by symmetry, is equivalent to saying that all columns are 
permutations of one another). 

Recall that the minimal pointwise redundancy is of order 
O(logn) if and only if f{Xi) = with probability one; 
otherwise it is 0{^/n). Our first result says that the rate 
cannot be 0(log7i) for many distortion levels D, unless the 
distribution P is uniform in which case the rate is O(logn) 
for all distortion levels D. 

Theorem 1: 

(a) If P is the uniform distribution on A and p is a per- 
mutation distortion measure, then f{Xi) = with proba- 
bility one for all D G (0,i?i„ax)- 

(6) If f{Xi) — with probability one for a sequence of 
distortion values D„ G (0,Dmax) such that £)„ [ 0, then 
P is the uniform distribution and p is a permutation dis- 
tortion measure, and therefore f{Xi) = with probability 
one for all D S (0, Dmax)- 

As we mentioned above, the rate-distortion function 
R{D) is convex for D € (0, Dmax)- If it is strictly con- 
vex (as it is usually the case - see the discussion in [|[ 



Chapter 2]), then Theorem 1 can be strengthened to the 
following. 

Corollary 1: Suppose R{D) is strictly convex over the 
range D £ (0, Umax)- If f{Xi) = with probability 
one for infinitely many D G (0, 1?max) then P is the uni- 
form distribution and p is a permutation distortion mea- 
sure, and therefore f{Xi) = with probability one for all 

De (0,i?„iax). 

Remark. In the examples presented in the previous sec- 
tion it turned out that either f{Xi) — with probability 
one for all D, or it was never the case. But it may happen 
that f{Xi) = with probability one only for a few isolated 
values of D, while P is not the uniform distribution. Such 
an example is given after Lemma 3 in Section III-B. 

B. The Continuous Case: A ~ R 

Here we take A = R and we assume that the distribution 
P of the source has a positive density g (with respect to 
Lebesgue measure), or, more generally, that there exists 
a (nonempty) open interval / C K on which P has an 
absolutely continuous component with density g such that 
g{x) > for X E I. Since the reproduction alphabet A = 
{oi, 02, . . . , flfe} is finite, given a distortion measure p we 
can write rj{x) = p{x,aj) for all 1 < j < fc and all x E A. 
We assume that for all j the functions rj are continuous on 
/. For convenience we also define, for j = 0, rj{x) = on 
/. 

Our next result gives a sufficient condition on the distor- 
tion measures rj , under which the best redundancy rate in 
(^ can never he O(logn). 

Theorem 2: If for every A < the functions 



=Ar,(-) 



3 = 0, 1, 



are linearly independent on /, then f{Xi) cannot be equal 
to zero with probability one for any distortion level D E 

(0,A„ax). 

Next we provide a somewhat simpler set of conditions, 
under which we get a weaker conclusion. Theorem 3 says 
that the best redundancy rate in (g) cannot be O(logn) for 
many distortion levels D. 

Theorem 3: Under either one of the following two condi- 
tions, f{Xi) cannot be equal to zero with probability one 
for distortion levels D > Q arbitrarily close to zero. 

(a) There exist (distinct) points {xq^xi, . . . ,Xk\ in / 
such that, for all < i 7^ j < fc, with j ^ 0, we have 

^si^j) >r3{xi). 

(6) There exist (distinct) points {xq, xi, . . . ,xu\\nl such 
that, for every permutation tt of the indices {0, 1, . . . , A:} 
with TT not equal to the identity, we have 

k k 

3=0 J=0 

Although the conditions of Theorems 2 and 3 may seem 
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unusual, they are natural and generally easy to verify. To 
illustrate this, we present below two simple examples. 

Example 3 (Mean-Squared Error) Suppose P has a pos- 
itive density on the interval / = [—2, 2], let A consist of the 
two reproduction points ±1, and let p be the mean-squared 
error distortion measure. Recall that, to satisfy (0), p{x, y) 
is actually defined by 

p{x,y) = (x- yf - min{(.T - 1)^, [x + if}. 

The corresponding distortion functions ri{x) = p(a;, — 1) 
and r^ix) = p(x, -1-1) are shown in Figure 1. Here, condi- 
tion (a) of Theorem 3 is easily seen to hold with xq = 0, 
xi = 2 and X2 = —2. 






Fig. 1 
Distortion measure in Example 3. Reproduction points are 

SHOWN AS x'S. 



Example 4 {L^ Distance) Suppose P has a positive den- 
sity on the interval / = [0, 6] , let A = {1, 3, 5}, and take p to 
be the normalized L^ distance \x — y\ adjusted so that (|^) is 
satisfied; the resulting functions rj (•) are shown in Figure 2. 
Here it is easy to verify that the condition of Theorem 2 
is satisfied, i.e., that the functions {e'^'"^^) ; < j < 3} 
are linearly independent on /. For this it suffices to ob- 
serve that e^^^ and e"*""^ are linearly independent on [2, 4] 
(essentially because the functions e^^ and e""^^ are linearly 
independent on [0, 2]), and that e^^^ is not constant outside 
[2,4]. 

Like in the discrete case, under some additional assump- 
tions on the rate-distortion function R{D), it is possible to 
get a stronger version of Theorem 3: 

Corollary 2. Suppose R{D) is differentiable and strictly 
convex on (0, Dmax)- Under either one of the assumptions 
(a) and (b) in Theorem 3, there can be at most finitely 
many D € (0, -Dmax) such that f{Xi) — with probability 
one. 

Remark. Under somewhat more restrictive assumptions 
on the distortion measure p, it is possible to prove that, 
for any P with a continuous component as above, there 



1 



4 



2 3 

Fig. 2 
Distortion measure in Example 4. Reproduction points are 

SHOWN AS x'S. 



can be at most k{k + l)/2 distortion levels D for which 
/(Xi) = with probability one. Since the proof of this 
slightly stronger result relies on an argument different from 
the ones used to prove Theorems 2 and 3, we omit it here. 



III. Proofs 



A. Prelminaries 



Before giving the proofs of Theorems 1, 2 and 3, we recall 

and give the precise 



some definitions and notation from 1 10 



form of the function / (see equation ( LJ) below) . 

Let P be a source distribution on A, and let Q be an 
arbitrary probability mass function on A. Write X for a 
random variable with distribution P on A, and Y for an 
independent random variable with distribution Q on A. 
Let S ~ {a E A : Q(a) > 0} be the support of Q and 
define 






min/3(X, a) 



r,p,Q 

max 



For A < 0, let 

Ap,q(A) = Ep 



Ep^Q[p{X,Y)] 



log,i?g(e^''(^.^)) 



and for D > write Ap g for the Fenchel-Legendre trans- 
form of ApQ, 

A^_g(Z?)=sup[Ai?-Ap,Q(A)]. 

A<0 

We also define 

R{P,Q,D) ^ mi[I{X;Z) + H{Qz\\Q)] 

(X,Z) 

where H{R\\Q) = X^aeA ^(a) log[^(a)/Q(a)] denotes the 
relative entropy (in bits) between R and Q, Qz denotes 
the distribution of Z , and the infimum is over all jointly 
distributed random variables {X, Z) with values in Ay. A 
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such that X has distribution P and E[p{X, Z)] < D. In 
view of (H), we clearly have 



R{D) ^ inf R{P,Q,D). 

all Q 



(9) 



In Lemma 1 and Proposition 1 below we summarize 
some useful properties of Ap^g, Apg and R{P, Q, D) (see 
Lemma 1 and Propositions 1 and 2 in |10|| ). 

Lemma 1: 

(z) Ap^Q is infinitely differentiable on (— oo,0), and 
Ap,q(A) > for all A < 0. 

{ii) If D G {D^(l, D^2<) then there exists a unique A < 
such that Apq(A) = D and h*p^Q{D) = \D ~ Ap.q(A). 

Proposition 1: 
(i) For aU D>Q, 

R{P,Q,D) = inf Ep [H{W{-\X)\\Q{-))] , 
w 

where the infimum is over all probability measures W 
on ^4 X A such that the ^-marginal of W equals P and 
Ew[p{X,Y)]<D. 

(ii) For all D>0, R{P,Q,D) = (loge)Ap_Q(D). 

(Hi) For < Z? < -Dmax we have < R{D) < oo, whereas 

for D > i?,nax, BiD) = 0. 

{iv) For every D g (0, Umax) there exists a Q = Q* on 
A achieving the infimum in (||), and D G {D^^^ , I?^^ ). 

For any distribution P on A and any distortion level D e 
(0,Dniax(-F')), by Proposition 1 we can pick a Q* achieving 
the infimum in ^ so that R{D) = R{P,Q*,D) and also 

D g (£',„';„ , -Dmax')' ^^ ^y Lcmma 1 we can pick a A* < 
with 

A*i?-Ap,Q.(A*) = A^_Q.p) 

= {log,2)RiP,Q*,D) 

= (log,2)i?p). (10) 



Note also that 



A* 



-oo as D ^ 



(11) 



(see the Appendix for a short proof) . Finally we can define 
the function /, for x € A, 

fix) = (loge) [A*i^- log, ii;Q. (e^X-^^))] -i?(i?). (12) 

Since Ep[f{Xi)] = 0, f{Xi) = with probability one if 
and only if 



^Q*{aj)e^'P'^'''°'''^ = Constant, for P-almost ah x. (13) 

Next we give an useful interpretation for the constant A* in 
the representation of R{D) in (iffl: If R{D) is differentiable 
at D, then A* is proportional to its slope at D; Lemma 2 
is proved in the Appendix. 



Lem,m,a 2: For any D G (0, Umax): 

(i) We have (log, 2)i?(L>) = sup;,<o [XD - r(A)], where 
r(A)=supQAp,Q(A). 

(m) Let A* be chosen as in ([lO|). If R{-) is differentiable 
at £>, then A* = (log, 2)R'{D). 

B. Proofs in the Discrete Case 

For the proof of Theorem 1 we will need the following 
lemma. It easily follows from Theorem 3.7 in Chapter 2 
of H] (see the Appendix). Recall the notation Pi = P{ai) 
and p^j = p{ai,aj). 

Lemma 3: A probability mass function Q* on A achieves 
the infimum in (|9|) if and only if there exists a A* < such 
that the following all hold: 

(a) h'p^Q,{\*)^D. 

(b) If we define, for a.;, Oj G A, 



W{a^,aj)^P,Q*{a,) 



=-'»*Pij 



E^'Qia.Oe^'"-^' 



then the second marginal of W is Q* 
(c) If Q*{aj) = for some j, then 



E^^ 



o>-' Pij 



E,,Q*(a/)e^*'''^' 



< 1. 




Example 5: Here we present a simple example illustrat- 
ing the fact that it may happen that f{Xi) = for a 
few isolated values D even when P is not uniform. Take 
A = A = {0, 1,2}, let a = logg[3e/(4 — e)], and consider 
the distortion measure 



iP^3) 



Then with P = Q* ^ (4/13,4/13,5/13) and A* = -1, it 
is straightforward to check that condition (b) of Lemma 3 
holds (condition (c) is irrelevant here), and also (|l^) is sat- 
isfied. Therefore, dX D = Apg. (A*) « 0.43, we must have 
f{Xi) = with probability one. [Note, also, that the dis- 
tortion measure used here is not a permutation distortion 
measure.] 

Proof of Theorem 1, (a): Suppose p is a permutation 
distortion measure and P is the uniform distribution on A, 
Pi = 1/fc for all i = I, . . . ,k. First we claim that for any 
D G (0, Umax) we can take Q* to also be uniform. With 
Q*{aj) = 1/fc for all j, it suffices to find A* < satisfying 
(a) and (b) of Lemma 3 (part (c) is irrelevant here). We 
have D'^f ^ and 



^P.Q' - \^ 



1 1 



1, 



Anax -l^lk^^i^l^ 



A 



where S = ^iPij, which is independent of j (since p is a 
permutation). Also by the permutation property, Umax = 
minj Ep[p{X,aj)] — miuj ^.(l/fc)py — (l/fc)!]. Choose 
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and fix a Z? e (0, Umax), and pick A* < as in (|10| ) so that 
Lemma 3 (a) holds. With this A* and Q* being uniform 
let W* be as in Lemma 3 (b); then 



^W*{a,,a,) = ^ 



1 1 



o^*Pi. 









-y 



pA'pi. 



T.r- 



A-Pi. 



But the sum in the denominator above 



E' 



XPi- 



is independent of i 



(14) 



because p is a permutation, so '^iW*(ai,( 



1/fc 



Q*{aj), and (b) is satisfied. This proves that we can take 
Q* to be uniform. Now simply multiplying (04) by 1/fc we 
obtain (p^), and this implies that f{x) = for all x £ A. 
Since D e (0, Dmax) was arbitrary, we are done. □ 

Proof of Theorem 1, (b): Let £)„, n > 1, be a sequence 
of of distortion values in (0, Dmax) for which f{Xi) = 
with probability one, and such that _D„ | 0. For each 
D„, we can pick Qn and a A„ < as in (|lO| ) such that 
R{D„) = (loge)A^QjA„) and D„ = A'pqJX„). Let 



D = {imn Pi) {min Pi j) > 0. 



Then for all ji large enough so that I?„ < D, we must 
have Qnidi) > for all i (otherwise it is trivial to check 
that A'p Q^ (A) > D for any A < 0, contradicting the choice 
of A„). From now on we restrict attention to these large 
enough n's. As discussed above, /(Xi) = with probabil- 
ity one if and only if condition (|l3| ) holds, which, in this 
case, becomes 



J2Qn{a,)e^"'' 



is independent of i. 



(15) 



j=i 



By Lemma 3 (b) we have that for all j 



E^^ 



EfQn{a,,)e^'P^: 



= 1, 



but by ( |l5| ) the denominator is independent of i so 



j:p. 



,A,iPij 



independent of j . 



(16) 



By (O), A„ — > — oo as 71 — > CX3, so letting ti — ^ oo yields 
Pj = lim„ c„ for all j, so P is the uniform distribution 
(recall our assumption that pij = if and only if i = j). 
Moreover, from (nM it follows that 



(17) 



\~^gA„Pij _ ^^^^ independent of _ 



To show that p is a permutation, fix two arbitrary in- 
dices j 7^ j' and reorder the vectors {pij, . . . , pkj) and 
{piji ^ . . . , Pkj') so that their elements are nondecreasing. 



Let (cTi, . . . , (Tfc) and {a'l, . . . ,a'i^) be the corresponding or- 
dered vectors. Then ai = a[ = and (|I^) implies that 



k 

Ee 

1=2 



A„((Ti-CT2) _ 



E 



^^,^{cr'i-<y'2) 



Next we show that if a2 ^ 1721 ^^y a2 > cr'2, we get a 
contradiction. Since cr,; — ctj > for all i > 2, the left- 
hand-side above tends to as n — > 00, but the right-hand- 
side is > 1. Therefore (72 = <^'2- Continuing inductively, 
o-j = cr- for all i, so {pij, . . . , pkj) and (py, . . . ,pkj') are 
permutations of one another. Since j and j' were arbitrary, 
this completes the proof. □ 

Proof of Corollary 1: As before, let £)„, n > 1, be 
a sequence of of distortion values in (0, D^ax) for which 
f{Xi) — with probability one, and let Qn and A„ < be 
chosen such that R{Dn) = (loge)ApQ^(A„). Since R{D) 
is differentiable on (0, Dmax) (see |Q, Theorem 2.5.1]), from 
Lemma 2 we get that A„ = (logg 2)i?'(_D„). Moreover, since 
we assume that R{D) is strictly convex on (0, 1?max), the 
A„ are all distinct. 

If the sequence {A„} is unbounded, i.e., it has a subse- 
quence that tends to —00, then we can proceed exactly as in 
the proof of Theorem 1. So assume that the sequence {A„} 
is bounded. Since for each n, R{P,Qn,Dn) = R{Dn) > 0, 
there must be a subset S of {1, 2, . . . , fc} of size N ^ say, 
with A^ = |5| > 2, such that infinitely many of the Qn are 
supported on {aj : j & S}. Without loss of generality we 
can relabel the elements of A so that S = {1,2, . . . , N}. If 
N = k then we can again repeat the argument in the proof 
of Theorem 1. 

Assuming N < k — 1, we proceed to get a contradiction. 
Since /(x) =0 with probability one, condition (|1^) implies 
that 



fe 

E 



N 



Qn {a J 



= E'3"(%)e 



\nPi 



for all ' 



Defining p^g = for all i, and letting r(A) denote the 
(A'' -}- 1) X (A^ -I- 1) matrix with entries exp(Apy) for i = 



1,2,. ..,N 
imply that 



1 and i — 0,1, ... ,N, the above conditions 



T{\n) 



Qn{a.\) 



\ Qn{aN) / 



= e R^+\ 



Therefore det(T(A„)) = for all A^. The sequence {A„} 
is bounded so it must have an accumulation point, and 
since det(T(A)) is an analytic function of A it can only 
have isolated zeroes unless it is identically zero (see, e.g., 
the discussion in |^, Section 4.3.2]). So here we must have 
that det(T(A)) = for aU A < 0. But as A -^ -c», r(A) 
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converges to the matrix 



/I 



T — 



1 



\ 



In 

V 1 ••• 0/ 



which has determinant equal to 1 or —1 (Ijv denotes the 
NxN identity matrix) , and this provides the desired con- 
tradiction. D 

C. Proofs in the Continuous Case 

Proof of Theorem 2: We argue by contradiction. Suppose 
f{Xi) = with probabihty one for some D £ iO,Dmax)- 
Choose a Q* and a A* < as in (|lO|). Then (|l3| ) imphcs 
that 



3 = 1 



Q*{aj)e^'''''-^^ = Constant, for P-almost all x, 



but since P has an absolutely continuous component with 
positive density on J, and since the functions rj{-) are 
assumed to be continuous, this holds for all x £ I, and 
therefore contradicts the linear independence assumption 
of Theorem 2. D 

Proof of Theorem 3: First we observe that condition (a) 
immediately implies condition (b). Therefore it suffices to 
show that if condition (b) holds, f{Xi) cannot be equal 
to zero with probability one for distortion levels D > 
arbitrarily close to zero. We proceed as in the proof of 
Corollary 1. Assuming that there is a sequence Dn, n> 1, 
of distortion values in (0, Dmax) for which f{Xi) — with 
probability one, and such that Dn i 0, we will derive a 
contradiction. Pick Qn and A„ < such that R{Dn) = 
(loge)A*p^QjXn). By@, 



Y^ Q„(aj)e^"'^^("^' = c„, for P-almost all x£L (18) 

Since P has an absolutely continuous component with pos- 
itive density on /, and since the functions rj{-) are assumed 
to be continuous, (|l8|) holds for all x £ I. In particular, for 
the points xo, . . . , Xfe in condition (b), (Hq) becomes 



T{\n) (-c„, Q„(ai), . . . , Qn{ak))' = G M'=+\ 



where T(\) is the (fc 



tries exp(Arj(a;i)), < 



1) X (fc + 

hi < k, 



1) matrix with en- 
and v' denotes the 
transpose of a vector v. Therefore, since the entries of 
the vector (Q„(ai), . . . ,Qn(afe)) sum to 1, it follows that 
det(T(A„)) = for all n, or, equivalently, 

det(f(A„)) = ^(_i)«ig'i('^)e^"S'=o'^^(^-0)) 

= ^(-l)"'s"We-^-^' = 0, (19) 



where the sums are taken over all permutations tt of 
the set {0,1,..., fc}, and the constants s^ are given by 
^ • g rj(a;7r(j)). Therefore, for any real number s > 0, we 
must have that 



E 



(-1) 



sign(7r) 



(20) 



To see this, let {s(l),s(2), . . .} be the (finite) increasing 
sequence of all possible values for the constants s^. Then 
( |l9| ) implies that 



E 



(-ly 



ign(7r)gA„ 



Kl) 



.s{l) 



E 



(_l)sign(Tr)gA„ 



0. 



By (P), A„ 

hv P^»«(l) i 



-cxD as n — > cx), so multiplying both sides 
by e^'^"*^^-' and letting n — > cx) yields ( [2C| ) with s = s(l). 
Continuing this way with s(2), then s(3) and so on, proves 
(H) for all s. 

But now notice that condition (b) implies that, if n* 
denotes the identity permutation, then s^ 7^ s„. for all 
other permutations tt. Therefore, taking s — s^. in (Qm 
we get the desired contradiction. D 

Proof of Corollary 2: Let _D„, n > 1, be a sequence of 
distortion values in (0,-Dmax) for which f{Xi) = with 
probability one, and pick Qn and A„ < as in the proof 
of Theorem 3. If the sequence {A„} is unbounded, we can 
repeat the exact same proof as for Theorem 3. So assume 
that {A„} is bounded. Since we also assume that R{D) is 
differentiable and strictly convex, it follows from Lemma 2 
that the A„ = (logg 2)i?'(Z?„) are all distinct. Proceeding 
as in the proof of Theorem 3, we get that det(T(A)) = for 
all A = A„. The sequence {A„} is bounded so it must have 
an accumulation point, and det(T(A)) is an analytic func- 
tion of A. Therefore, arguing as in the proof of Corollary 1, 
det(T(A)) = for all A < 0. So we can find a sequence 
X',n -^ —00 for which det(T(A^)) — 0. With A^ in place 
of A„, the argument proceeds exactly as in the proof of 
Theorem 3. □ 
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Appendix 

Proof of ^1^): Suppose { \\/\\) is false. Then it is possible to 
pick a constant K < 00 and a sequence of D„ £ (0, Umax) 
with corresponding A* < 0, such that Dn — > as n ^ 00 
but A* > —K for all n. Let Q* achieve (^ with D = Dn, 
so that 



A'p^Q.JK) ^ Dn. 

For each n, recalling that p{x, y) < M for all x, y, 

"i5Q„(p(X,F)e^>(^.^)) 



(21) 



^PnS^n) 



E, 



eqA^ 



a;p(x,y)^ 
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> 

> 
> 



EQ(p{X,Y)e 



Kp(X'Y) 



Ep 



EQ„[Ep{p{X,Y)e-^''')] 






which is bounded away from zero. Since the £)„ | 0, this 
contradicts (pi|). D 

Proof of Lemma 2: Part (i) immediately follows from 
the minimax representation in ||l^, Lemma 2]. For (ii) 
note that, since Ap^g(A) is continuous and convex in A 
(Lemma 1), r(A) is lower semicontinuous and convex. Then 
by convex duality (see, e.g.. Lemma 4.5.8 in 0), it follows 
that r(A) = sup^>o[Aa; - (log^ 2)R{x)]. For D e (0, Dn^ax) 
and A* as in (|o|), we have 

r(A*) = X*D - (log, 2)R{D) = sup[A*x - (log, 2)R{x)]. 

x>0 

But since _/?(•) is convex and (by assumption) differentiable 
at D, it must be that the derivative of [X*x — (log, 2)R{x)] 
vanishes a,t x — D, i.e., A* — (log, 2)R'{D). D 

Proof of Lemma 3: First suppose that for some A* < 0, 
(a), (b) and (c) all hold. For z = 1, . . . , A;, let 



References 

[I] L.V. Ahlfors, Complex Analysis, McGraw-Hill, New York, 1953. 
[2] P.H. Algoet, Log-Optimal Investm.ent, Ph.D. thesis, Dept. of 

Electrical Engineering, Stanford University, 1985. 

[3] A.R. Barron, Logically Sm,ooth Density Estimation, Ph.D. the- 
sis, Dept. of Electrical Engineering, Stanford University, 1985. 

[4] T. Berger, Rate Distortion Theory: A Mathematical Basis for 
Data Compression, Prentice-Hall Inc., Englewood Cliffs, NJ, 
1971. 

[5] T.M. Cover and J. A. Thomas, Elements of Information Theory, 
J. Wiley New York, 1991. 

[6] I. Csiszar and J. Korner, Information Theory: Coding Theorems 
for Discrete Memoryless Systems, Academic Press, New York, 
1981. 

[7] A. Dembo and O. Zeitouni, Large Deviations Technigues And 
Applications. Second Edition, Springer- Verlag, New York, 1998. 

[8] J.C. Kieffer, "Sample converses in source coding theory," IEEE 
Trans. Inform. Theory, vol. 37, no. 2, pp. 263-268, 1991. 

[9] I. Kontoyiannis, "Second-order noiseless source coding theo- 
rems," IEEE Trans. Inform. Theory, vol. 43, no. 4, pp. 1339— 
1341, July 1997. 

[10] I. Kontoyiannis, "Pointwise redundancy in lossy data compres- 
sion and universal lossy data compression," IEEE Trans. In- 
form. Theory, vol. 46, no. 1, pp. 136-152, January 2000. 

[II] C.E. Shannon, "Coding theorems for a discrete source with a 
fidelity criterion," IRE Nat. Conv. Rec, vol. part 4, pp. 142- 
163, 1959, Reprinted in D. Slepian (ed.). Key Papers in the 
Development of Information Theory, IEEE Press, 1974. 



B, 



j:,Q*iaj)e^''''^ 



Then (b) and (c) imply that equations (3.19) and (3.20) in 
[^ p. 145] are satisfied with S = —A*, so by [|| Theorem 3.7] 
equation (3.18) is satisfied by W*. This, together with 
Lemma 3.1 in [|6|, Chapter 2] implies that 

R{D) =H{W\\PxWy) 

where Wy is the second marginal of W*. But Wy — Q* , 
so R{D) = Ep[H{W*{-\X)\\Q* {■))], and by the defini- 
tion of W* and Proposition 1, Ep[H{W*{-\X)\\Q*{-))\ = 
R{P,Q\D). 

Conversely, suppose Q* achieves the infimum in (0). 
Then by Lemma 1 there is a (unique) A* < such that 
(a) holds, and letting W* be defined as in (b) we also have 



R{D) ^^ 

(b) 



R{P,Q\D) 



= H{W*\\PxQ*) 
^=^ HiW*\\PxWp) 

H{W*\\PxWy) 
R{D) 



HiwpWQ*) 



id) 

> 

> 



where (a) follows by assumption; (6) from (|To|), Proposi- 
tion 1 and the definition of W*; (c) by the chain rule for 
relative entropy (see [|[ Theorem 2.5.3]); (d) is because rel- 
ative entropy is nonnegative; and (e) follows from the defi- 
nition oiR{D) in (|). Therefore HiW^\\Q*) = 0, implying 
(b). Finally note that the above argument shows that W* 
achieves R{D). Then by Theorem 3.7 in §, p. 145] W* 
satisfies equation (3.18) of Jg p. 145] with S = —A*, and 
by the uniqueness of the constants Bi and equation (3.19) 
of §, p. 145] we get (c). D 
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